We see one undefined column which has every value as NAN. We will drop this column. Rest all column has no null values.

There are 6 columns with data type as 'object'. We Will convert them into categorical variables for model building.

Observations:

  1. Some records are there with unknown Education level.
  2. Some records are there with unknown Marital status.
  3. Some records are there with unknown Income category.

Observations:

CLIENTNUM being a unique identifier doesn't add any meaning to the analysis and can be dropped.

Perform an Exploratory Data Analysis on the data

Bivariate Analysis

Observations:

Missing-Value Treatment

Split the data

Imputing Missing Values

Encoding categorical varaibles

Before building the model, let's create functions to calculate different metrics- Accuracy, Recall and Precision and plot the confusion matrix.

Model building - Logistic Regression

SMOTE to upsample smaller class

Down Sampling the larger class

Model building - Bagging and Boosting

Bagging Classifier

Random Forest Classifier

Decision Tree Model

Observatios:

  1. All three models are overfitting the train data.
  2. Decision tree is least performing in term of accuracy but random forest has least recall.
  3. Out of these three models, bagging classifier is best performin overall.

Reducing overfitting

AdaBoost Classifier

Gradient Boosting Classifier

XGBoost Classifier

Observation

  1. XGBoost model overfits the train data.
  2. For XGBoost Recall is not improved significantly compared to ada boost and gradient boost.
  3. Gradien boost classifier is givng better performance among all boosting classifier.

Bagging Classifier

Observation

  1. Model is overfitting the data.
  2. Test Recall is low as compared to other model we have seen above.
  3. Time taken is 1 min 56 Sec.

Random Forest

Observation

  1. Model is overfitting the data.
  2. Test Recall is low as compared to other model we have seen above.
  3. Time taken is 9 min 28 Sec. Increased from previous model.

Decision Tree

Observation

  1. The overfitting is slightly reduced.
  2. Test Recall has also decreased as compared to other model we have seen above.
  3. Time taken is just 32.5 Sec.

ADA Boost

Observation

  1. Model is overfitting the data.
  2. Test Recall is little improved here.
  3. Time taken is 23 min 27 Sec.

Gradient Boost

Observation

  1. Model is overfitting the data.
  2. Test Recall is little improved here.
  3. Time taken is 8 min and 35 Sec.

XG Boost

Observation

  1. The overfitting is greately reduced.
  2. Test Recall has also increased as compared to other model we have seen above.
  3. Time taken is 7 hr and 31 mins 35 Sec which is highest amongst all baove models. We will see Random search reduces this or not.

Bagging Classifier

Observation

  1. Model is overfitting the data.
  2. Test Recall is little improved here compared to grid search bagging classifier.
  3. Time taken is 8 min and 35 Sec.

Random Forest

Observation

  1. Model is overfitting the data.
  2. Test Recall is low as compared to other model we have seen above.
  3. Time taken is 1 min 16 Sec. Significantly reduced from previous model.

Decision Tree

Observation

  1. Overfitting is greatly reducd.
  2. Test Recall is very low as compared to other model we have seen above.
  3. Time taken is just 1.48 sec. This is quickest model yet.

ADA Boost

Observation

  1. This model is clearly Overfitting.
  2. Test Recall is also low as compared to other model we have seen above.
  3. Time taken is just 24.4 sec.

Gradient Boost

Observation

  1. This model is Overfitting the data.
  2. Test Recall is coparable with other model we have seen above.
  3. Time taken is just 39.1 sec.

XG Boost

Observation

  1. This model is generalizing well with the data.
  2. Test Recall is highest as compared to other model we have seen above.
  3. Time taken is just 14.5 sec. This model seems to be behaving best by considering all factors.

Comparing all models

Observation

Business Insight and Recommendation:

Model evaluation criterion:

  1. For XGboost random search tuned model the features "Total Revolving balance", "Total transaction count", "Total transaction amount", "Total_Ct_Chng_Q4_Q1", 'Months_Inactive_12_mon' seems to be top 5 features which influence the model's output. Bank can choose to target those specific customers to maximize the sales of the package.

  2. After achieving the desired accuracy we can deploy the model for practical use. The Bank now can predict who will churn the card and who will not.